class: center, middle, inverse, title-slide # Introduction to ggplot2 ### Antoine & Nicolas ### cynkra GmbH ### February 24th, 2022 --- <style type="text/css"> .remark-code { font-size: 12px; } .font17 { font-size: 17px; } .font14 { font-size: 14px; } </style> # Exercises! Let's wrap everything up ! Analyze gapminder data ```r install.packages("gapminder") library(gapminder) gapminder ``` ``` ## # A tibble: 1,704 × 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## # … with 1,700 more rows ``` --- # Video introduction to the dataset <iframe width="709" height="399" src="https://www.youtube.com/embed/jbkSRLYSojo?t=29" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> --- # Can we reproduce this ?  --- # Research questions * How did life expectancy change in the course of the last decades? Did id change differently between the continents? * How does life expectancy differs today between the continents? * Is life expectancy related to GDP? If so, to what degree (and form)? Is this assocication moderated by continent? --- # Report * We will work directly in a Rmd report as shown in previous part * We give some steps to follow and your mission is to arrange those into a nice report * We encourage you to take notes directly in the report --- # America's poorer countries * Use `summary(gapminder)` to get an overview of the data * Filter the data for the Americas in 2007 * Create the variable gdp, defined as the product of population size and gdp per person. * select `country`, `lifeExp`, `gdp` and `gdpPercap` * Keep the 5 countries with the lowest gdp --- # America's poorer countries * Use the pipe to combine these operations * save the new data in a variable * print it * plot it --- # America's poorer countries ```r library(tidyverse) rich_americas_2007 <- gapminder %>% filter(continent == "Americas", year == 2007) %>% mutate(gdp = gdpPercap * pop) %>% select(country, lifeExp, gdp, gdpPercap) %>% arrange(gdp) %>% slice_head(n = 5) rich_americas_2007 ``` ``` ## # A tibble: 5 × 4 ## country lifeExp gdp gdpPercap ## <fct> <dbl> <dbl> <dbl> ## 1 Haiti 60.9 10217297216. 1202. ## 2 Nicaragua 72.9 15603375235. 2749. ## 3 Trinidad and Tobago 69.8 19027934931. 18009. ## 4 Jamaica 72.6 20353013485. 7321. ## # … with 1 more row ``` --- # America's poorer countries ```r library(tidyverse) ggplot(rich_americas_2007, aes(gdpPercap, lifeExp, color = country)) + geom_point(size = 10) + scale_y_continuous(limits = c(0, max(rich_americas_2007$lifeExp))) ``` <!-- --> --- # Comparing continents * Do the same but using averages by continent * Do the same but using averages weighted by population * place both plots side by side using {patchwork} * use `+ plot_layout(guides = 'collect')` to collect legends * use `+ plot_annotation()` to add a global title ) --- # Comparing continents ```r continents <- gapminder %>% group_by(continent) %>% summarize( lifeExp_avg = mean(lifeExp), lifeExp_wavg = sum(pop*lifeExp) / sum(pop), gdpPercap_avg = mean(gdpPercap), gdpPercap_wavg = sum(pop*gdpPercap) / sum(pop), pop = sum(pop), .groups = "drop") ``` --- # Comparing continents ```r p1 <- ggplot(continents, aes(lifeExp_avg, gdpPercap_avg, size = pop, color = continent)) + geom_point() + labs(title = "simple average") p2 <- ggplot(continents, aes(lifeExp_wavg, gdpPercap_wavg, size = pop, color = continent)) + geom_point() + labs(title = "weighted average") library(patchwork) p1 + p2 + plot_layout(guides = 'collect') + plot_annotation("Averages can't be trusted!") ``` <!-- --> --- # Comparing continents This looks like a chart that should be faceted doesn't it ? We'll need to transform it in tidy form for that. We want one observation per row, where one observation translates to one dot on the chart. ```r continents ``` ``` ## # A tibble: 5 × 6 ## continent lifeExp_avg lifeExp_wavg gdpPercap_avg gdpPercap_wavg pop ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Africa 48.9 50.6 2194. 2108. 6187585961 ## 2 Americas 64.7 69.5 7136. 15477. 7351438499 ## 3 Asia 60.1 61.1 7902. 2950. 30507333901 ## 4 Europe 71.9 72.3 14469. 15693. 6181115304 ## # … with 1 more row ``` --- # Comparing continents This is not trivial! but will sometimes be useful ```r continents_long <- continents %>% pivot_longer( cols = lifeExp_avg:gdpPercap_wavg, names_pattern = "(.*)_(.*)", names_to = c("var", "type"), values_to = "value") continents_long ``` ``` ## # A tibble: 20 × 5 ## continent pop var type value ## <fct> <dbl> <chr> <chr> <dbl> ## 1 Africa 6187585961 lifeExp avg 48.9 ## 2 Africa 6187585961 lifeExp wavg 50.6 ## 3 Africa 6187585961 gdpPercap avg 2194. ## 4 Africa 6187585961 gdpPercap wavg 2108. ## # … with 16 more rows ``` --- # Comparing continents This is not over because we want lifeExp and gdpPercap in separate columns ```r continents_tidy <- continents_long %>% pivot_wider(names_from = var, values_from = value) continents_tidy ``` ``` ## # A tibble: 10 × 5 ## continent pop type lifeExp gdpPercap ## <fct> <dbl> <chr> <dbl> <dbl> ## 1 Africa 6187585961 avg 48.9 2194. ## 2 Africa 6187585961 wavg 50.6 2108. ## 3 Americas 7351438499 avg 64.7 7136. ## 4 Americas 7351438499 wavg 69.5 15477. ## # … with 6 more rows ``` --- # Comparing continents Now that it's tidy we can take our previous code and tweak it slightly : ```r ggplot(continents_tidy, aes(lifeExp, gdpPercap, size = pop, color = continent)) + geom_point() + labs(title = "Averages can't be trusted!") + facet_wrap(vars(type)) ``` <!-- --> --- # Reproducing the plot : stage 1 Start again from the raw data * plot the 2007 data, with color mapped to `continent` and size mapped to `pop` * Use `alpha` in `geom_point()` to handle overplotting better * Use a logarithmic scale for x (use `scale_x_log10()`) * Set the colors manually using `scale_color_manual()` check the example at the bottom of `?scale_color_manual` (use a named vector). Use Europe orange, Asia red, Africa = blue, Americas yellow, Oceania green (Oceania is absent from picture, where Middle East has its own category, so we're a bit different) You can improve your colors using: https://r-charts.com/colors/ --- # Reproducing the plot * plot the 2007 data, with color mapped to `continent` and size mapped to `pop` .pull-left[ ```r gapminder %>% filter(year == 2007) %>% ggplot( aes(gdpPercap, lifeExp, color = continent, size = pop ) ) + geom_point() ``` <!-- --> ] .pull-right[  ] --- # Reproducing the plot * plot the 2007 data, with color mapped to `continent` and size mapped to `pop` .pull-left[ ```r gapminder %>% filter(year == 2007) %>% ggplot( aes(gdpPercap, lifeExp, color = continent, size = pop ) ) + geom_point(alpha = .4) ``` <!-- --> ] .pull-right[  ] --- # Reproducing the plot * Use a logarithmic scale for x (use `scale_x_log10()`) .pull-left[ ```r gapminder %>% filter(year == 2007) %>% ggplot( aes(gdpPercap, lifeExp, color = continent, size = pop ) ) + geom_point(alpha = .4) + scale_x_log10() ``` <!-- --> ] .pull-right[  ] --- # Reproducing the plot * Set the colors manually .pull-left[ ```r cols <- c( Europe = "brown", Asia = "red", Africa = "blue", Americas = "orange", Oceania = "green") gapminder %>% filter(year == 2007) %>% ggplot( aes(gdpPercap, lifeExp, color = continent, size = pop ) ) + geom_point(alpha = .4) + scale_x_log10() + scale_colour_manual(values =cols) ``` <!-- --> ] .pull-right[  ] --- # Reproducing the plot : stage 2 * Set the labels using `labs()` and remove the legend with `legend(position = "none")` * Zoom on a similar window using `coord_cartesian` * Set specific breaks on the x axis using the `breaks` arg in `scale_x_log10()` * Set specific breaks on the y axis using the `breaks` arg in `scale_y_continuous()` --- # Reproducing the plot * Set the labels using `labs()` and remove the legend with `theme(legend.position = "none")` .pull-left[ ```r gapminder %>% filter(year == 2007) %>% ggplot( aes(gdpPercap, lifeExp, color = continent, size = pop) ) + geom_point(alpha = .4) + scale_x_log10() + scale_colour_manual(values = cols) + labs(x = "income", y = "lifespan") + theme(legend.position = "none") ``` <!-- --> ] .pull-right[  ] --- # Reproducing the plot * Zoom on a similar window using `coord_cartesian` .pull-left[ ```r gapminder %>% filter(year == 2007) %>% ggplot( aes(gdpPercap, lifeExp, color = continent, size = pop) ) + geom_point(alpha = .4) + scale_x_log10() + scale_colour_manual(values = cols) + labs(x = "income", y = "lifespan") + theme(legend.position = "none") + coord_cartesian(ylim = c(22, 85)) ``` <!-- --> ] .pull-right[  ] --- # Reproducing the plot * Set specific breaks on the x axis using the `breaks` arg in `scale_x_log10()` .pull-left[ ```r gapminder %>% filter(year == 2007) %>% ggplot( aes(gdpPercap, lifeExp, color = continent, size = pop) ) + geom_point(alpha = .4) + scale_x_log10(breaks = c(400, 4000, 40000)) + scale_colour_manual(values = cols) + labs(x = "income", y = "lifespan") + theme(legend.position = "none") + coord_cartesian(ylim = c(22, 85)) ``` <!-- --> ] .pull-right[  ] --- # Reproducing the plot * Set specific breaks on the y axis using the `breaks` arg in `scale_y_continuous()` .pull-left[ ```r gapminder %>% filter(year == 2007) %>% ggplot( aes(gdpPercap, lifeExp, color = continent, size = pop) ) + geom_point(alpha = .4) + scale_x_log10(breaks = c(400, 4000, 40000)) + scale_colour_manual(values = cols) + labs(x = "income", y = "lifespan") + theme(legend.position = "none") + coord_cartesian(ylim = c(22, 85)) + scale_y_continuous(breaks = c(25, 50, 75)) ``` <!-- --> ] .pull-right[  ] --- # Reproducing the plot : stage 3 * Use a nice default theme among the `theme_*()` functions of ggplot and increase font size * Use `axis.text = element_text(size =)` in `theme()` to increase the size of the axis text, do the same for `axis.title` * Increase the size of the bubbles, using `scale_size(range=)` * Use the `label` argument in `scale_x_log10()` to set labels in dollars --- # Reproducing the plot .pull-left[ ```r p <- gapminder %>% filter(year == 2007) %>% ggplot( aes(gdpPercap, lifeExp, color = continent, size = pop) ) + geom_point(alpha = .4) + scale_x_log10(breaks = c(400, 4000, 40000)) + scale_colour_manual(values = cols) + labs(x = "income", y = "lifespan") + theme(legend.position = "none") + coord_cartesian(ylim = c(22, 85)) + scale_y_continuous(breaks = c(25, 50, 75)) + theme_bw() p ``` <!-- --> ] .pull-right[ * Use a nice default theme among the `theme_*()` functions of ggplot  Legends are back! Default themes override `theme()` so we must move `theme()` to the end ] --- # Reproducing the plot .pull-left[ ```r p <- gapminder %>% filter(year == 2007) %>% ggplot( aes(gdpPercap, lifeExp, color = continent, size = pop) ) + geom_point(alpha = .4) + scale_x_log10(breaks = c(400, 4000, 40000)) + scale_colour_manual(values = cols) + labs(x = "income", y = "lifespan") + coord_cartesian(ylim = c(22, 85)) + scale_y_continuous(breaks = c(25, 50, 75)) + theme_bw() + theme(legend.position = "none") p ``` <!-- --> ] .pull-right[ * Use a nice default theme among the `theme_*()` functions of ggplot  ] --- # Reproducing the plot .pull-left[ ```r gapminder %>% filter(year == 2007) %>% ggplot(aes(gdpPercap, lifeExp, color = continent, size = pop)) + geom_point(alpha = .4) + scale_x_log10(breaks = c(400, 4000, 40000)) + scale_colour_manual(values = cols) + labs(x = "income", y = "lifespan") + coord_cartesian(ylim = c(22, 85)) + scale_y_continuous(breaks = c(25, 50, 75)) + theme_bw() + theme(legend.position = "none", axis.text = element_text(size = 25), axis.title = element_text(size = 25)) ``` <!-- --> ] .pull-right[ * Use `axis.text = element_text(size =)` in `theme()` to increase the size of the axis text, do the same for `axis.title`  ] --- # Reproducing the plot .pull-left[ ```r gapminder %>% filter(year == 2007) %>% ggplot(aes(gdpPercap, lifeExp, color = continent, size = pop)) + geom_point(alpha = .4) + scale_x_log10(breaks = c(400, 4000, 40000)) + scale_colour_manual(values = cols) + labs(x = "income", y = "lifespan") + coord_cartesian(ylim = c(22, 85)) + scale_y_continuous(breaks = c(25, 50, 75)) + theme_bw() + theme(legend.position = "none", axis.text = element_text(size = 25), axis.title = element_text(size = 25)) ``` <!-- --> ] .pull-right[ * Use `axis.text = element_text(size =)` in `theme()` to increase the size of the axis text, do the same for `axis.title`  ] --- # Reproducing the plot .pull-left[ ```r gapminder %>% filter(year == 2007) %>% ggplot(aes(gdpPercap, lifeExp, color = continent, size = pop)) + geom_point(alpha = .4) + scale_x_log10(breaks = c(400, 4000, 40000)) + scale_colour_manual(values = cols) + labs(x = "income", y = "lifespan") + coord_cartesian(ylim = c(22, 85)) + scale_y_continuous(breaks = c(25, 50, 75)) + theme_bw() + theme(legend.position = "none", axis.text = element_text(size = 25), axis.title = element_text(size = 25)) + scale_size(range = c(1, 20)) ``` <!-- --> ] .pull-right[ * Increase the size of the bubbles, using `scale_size(range=)`  ] --- # Reproducing the plot .pull-left[ ```r p <- gapminder %>% filter(year == 2007) %>% ggplot(aes(gdpPercap, lifeExp, color = continent, size = pop)) + geom_point(alpha = .4) + scale_x_log10( breaks = c(400, 4000, 40000), labels = ~paste0("$",format( .x, big.mark = " ", trim = TRUE))) + scale_colour_manual(values = cols) + labs(x = "income", y = "lifespan") + coord_cartesian(ylim = c(22, 85)) + scale_y_continuous(breaks = c(25, 50, 75)) + theme_bw() + theme(legend.position = "none", axis.text = element_text(size = 25), axis.title = element_text(size = 25)) + scale_size(range = c(1, 20)) p ``` <!-- --> ] .pull-right[ * Use the `label` argument in `scale_x_log10()` to set labels in dollars  ] --- # Reproducing the plot : stage 4 * Build a dataset containing only the data for Congo ("Congo, Dem. Rep."), Ghana and South Africa * Use `geom_label()` to print the label on top of the chart * Use `geom_point()` to draw a big circle for these 3 countries, use `shape = 14`, it looks like a regular dot but it has a `fill` and a `color`, set both to constant values --- # Reproducing the plot * Build a dataset containing only the data for Congo ("Congo, Dem. Rep."), Ghana and South Africa ```r gapminder_subset <- gapminder %>% filter( year == 2007, country %in% c("Congo, Dem. Rep.", "Ghana", "South Africa")) %>% mutate(country = if_else(country == "Congo, Dem. Rep.", "Congo", as.character(country))) gapminder_subset ``` ``` ## # A tibble: 3 × 6 ## country continent year lifeExp pop gdpPercap ## <chr> <fct> <int> <dbl> <int> <dbl> ## 1 Congo Africa 2007 46.5 64606759 278. ## 2 Ghana Africa 2007 60.0 22873338 1328. ## 3 South Africa Africa 2007 49.3 43997828 9270. ``` --- * Use `geom_label()` to print the label on top of the chart .pull-left[ ```r p + geom_label(aes(label = country, size = NULL), data = gapminder_subset) ``` <!-- --> ] .pull-right[  ] --- # Reproducing the plot * Use `geom_point()` to draw a big circle for these 3 countries, use `shape = 14`, it looks like a regular dot but it has a `fill` and a `color`, set both to constant values .pull-left[ ```r p + geom_point( data = gapminder_subset, size = 4, shape = 21, stroke = 2, fill = "white", color = "blue") ``` <!-- --> ] .pull-right[  ] --- # Reproducing the plot * and place the labels at a constant height below these points .pull-left[ ```r p + geom_point( data = gapminder_subset, size = 4, shape = 21, stroke = 2, fill = "white", color = "blue") + geom_label( aes(label = country, size = NULL), data = gapminder_subset, y = 35 ) ``` <!-- --> ] .pull-right[  ] --- # We could go further! What do we miss ? * labels inside the axes * use annotate("text", ...) * Specific fonts and color * use `theme()` * background image * ggpubr::background_image Almost anything is possible, Google is your friend! --- # Animation we rebuild `p` but without filtering on year and we add a label on top with the year ```r p2 <- gapminder %>% # filter(year == 2007) %>% ggplot(aes(gdpPercap, lifeExp, color = continent, size = pop)) + geom_point(alpha = .4) + scale_x_log10( breaks = c(400, 4000, 40000), labels = ~paste0("$",format( .x, big.mark = " ", trim = TRUE))) + scale_colour_manual(values = cols) + labs(x = "income", y = "lifespan") + coord_cartesian(ylim = c(22, 85)) + scale_y_continuous(breaks = c(25, 50, 75)) + theme_bw() + theme(legend.position = "none", axis.text = element_text(size = 25), axis.title = element_text(size = 25)) + scale_size(range = c(1, 20)) ``` --- # Animation We add a label on top and set up animation ```r library(gganimate) p_anim <- p2 + geom_text( aes(label = as.character(year), x = 4000, y = 25), color = "lightgrey", size = 32, hjust = 0, vjust = 0) + transition_time(year) + ease_aes('linear') animate(p_anim, nframes = 200, fps = 10) ``` <!-- --> --- # Animation It's as trivial to change an animation as it is to change a plot ```r p_anim <- p2 + geom_text( aes(label = as.character(year), x = 4000, y = 25), color = "lightgrey", size = 10, hjust = 0, vjust = 0) + facet_wrap(vars(continent)) + theme_minimal() + transition_time(year) + ease_aes('linear') animate(p_anim, nframes = 200, fps = 10) ``` <!-- -->